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Objectives 


9  Develop  a  fluid  code  on  the  GPU  for  modeling  flows  with 
complex  chemical  kinetics.  The  entire  code  is  written  using 
CUDA  C/C++  for  maximum  flexibility. 

9  Explore  different  strategies  for  optimizing  the  performance  of 
the  code  for  a  general  chemistry  mechanism. 

9  Emphasis  on  the  kinetics  solver  since  it  is  more 
computationally  expensive. 

9  Benchmark  with  standard  test  cases. 
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Motivation 


®  Detail  study  of  non-equilibrium  processes  associated  with 
high-speed  flow. 

®  Detonation  instability 
®  Partially  ionized  gas 
®  MHD 

o  Development  of  a  multi-physics  code  utilizing  Object-Oriented 
and  CUDA  technology.  Both  of  these  features  are  available  in 
CUDA  C/C++. 
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Governing  Equations 


Euler  equations  with  source  term  for  chemical  kinetics 
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Solution  method: 

9  Finite  Volume  method  for  hyperbolic  conservation  laws 
o  Source  terms  are  solved  by  using  operator  splitting  technique 
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Numerical  Schemes 


o  Monoticity  Preserving1  (MP)  Schemes 

®  3rd  and  5th  order  spatial  discretization  was  used  in 

conjunction  with  3rd  order  TVD-Runge-Kutta  time  integration 

®  Arbitrary  Derivative  Riemann  solver  with  Weighted  Essential 
Non-Oscillatory2  (ADERWENO)  scheme 

9  5th  order  spatial  and  3rd  order  temporal  without  Runge-Kutta 
time  integration 

9  Utilizes  Cauchy-Kowalewski  procedure  and  Taylor  series 
expansion  of  WENO  fluxes  for  high  order  in  time 


^uresh  Si  Huynh  (1997)  J.  Comp.  Phys.  136,  83-99 

2Titarev  Si  Toro  (2001)  J.  Comp.  Phys.  204,  715-736 
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Chemical  Kinetics 


Implicit  formulation 


d Q  A  ,  A  dQ.  \  dQ  ■ 

-T-  =  ->  /  -  At—  =  S2 

dt  l  dQ  /  dt 


Elementary  Reaction: 
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Species  production/destruction  rate 
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Graphic  Processing  Unit 


What  is  GPU? 


9  Graphic  processing  units  containing  a  massive  amount  of 
processing  cores 

9  Designed  specifically  for  graphic  rendering  which  is  a  highly 
parallel  process 


Why  GPU? 


9  GPU  is  faster  than  CPU  on  SIMD  execution  model 
9  GPU  is  now  very  easy  to  program 
9  GPU  is  much  cheaper  than  CPU 
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GPU  versus  CPU 


2003  2004  2005  2006  2007  2008  2009  2010 


Figure:  Single  and  double  precision  floating  point  operation  capability  of 
GPU  and  CPU  from  2003-2010 
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GPU  Programming 


Programming  languages  for  GPU:  CUDA,  OpenCL, 
DirectCompute,  BrookGPU,  ... 

CUDA  is  the  most  mature  programing  environment  for  GPU. 
o  similar  to  C/C++ 
o  support  00  features 
e  easy  to  debug 
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GPU  Programming  Model 


o  Each  device  contains  a  set  of  streaming  multi-processor  (SM). 

Each  SM  contains  a  set  of  streaming  processors  (SP). 

®  Parallel  based  on  grid  and  thread  blocks 
®  Execution  instruction  called  kernel 

Thread 


per-Thread  Private 
Local  Memory 


per-Block 
Shared  Memory 


Grid  0 
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Application 

Context 

Global 
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Thread  Block 


Distribution  A:  Approved  for  Public  Relf 


Distribution  Unlimited 


H.  P.  Le  and  J.-L.  Cambi. 


Objectives  &  Motivation 
Approach 
GPU  Implementation 
Results 

Conclusion  and  Future  Works 

GPU  Programming  Model 
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CFD 


CFD: 

9  Cell-based  parallelization:  EOS,  time  marching,  etc. 

9  Face-based  parallelization:  Reconstruction,  flux,  etc. 
Strategies: 

9  Global  memory 

©  large  but  high  latency:  requires  coalesced  access 
9  Shared  memory 

©  small  but  very  fast;  not  useful  in  this  case  since  Nq  ~  Ns 
9  Reduce  block  occupancy  to  utilize  more  registers3. 


3Volkov  (2010)  Better  Performance  at  Lower  Occupancy,  GPU  Tech.  Conf. 
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Chemical  Kinetics 


Main  strategies 

®  Coalesce  memory  access  pattern  for  high  global  memory 
bandwidth 

®  Utilize  shared  memory  to  reduce  DRAM  latency 
®  Texture  binding  for  read-only  data 
Issues: 

o  How  to  overcome  shared  memory  limitation? 
o  How  effective  is  global  memory? 
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Summary  of  Steps  in  Gaussi^ 

in  Elimination  Algorithm 

9  Forward  substitution: 
for  np  =  1 : N— 1 
for  ns  =  np+l:N 

P  :=  A(ns,np)/A(np,np) 

RHS(ns)  :=  RHS (ns) -RHS (np) *P 
for  ms  =  np+1 : N 

A(ns,ms)  :=  A(ns ,ms)-A(np,ms)*P 

®  Backward  substitution: 
for  np  =  N-l : 1 
P  :=  0 

for  ns=np+l:N 

P  :=  P+A (np , ns ) *RHS (ns) 

RHS (np)  :=  (RHS(np)-P)/A(np,np) 
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Shared  Memory  Limit 


How  many  kinetics  system  can  we  put  on  shared  memory  (48 
KB/CUDA  block)? 
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Reduced  Storage  Pattern 


Store  one  row  of  matrix  in  shared  memory  for  each  row  elimination 
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Algorithms 


9  Algorithm  1:  store  matrix  data  on  global  memory  and 
coalesce  memory  access  pattern 

o  Inverse  several  matrices  per  CUDA  block 

9  Algorithm  2:  store  part  of  matrix  data  (one  row  at  a  time)  on 
shared  memory 

®  Load  and  reload  after  row  pivoting 
«  Inverse  one  matrix  per  CUDA  block 
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CFD  Results:  Forward  Step 


®  Mach  3  flow  over  a  step  with  reflective  boundary  on  top 
«  No  special  treatment  at  the  corner  of  the  step 
9  MP5  scheme  with  RK3  using  630,000  cells 
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CFD  Results:  Backward  Step 


9  Mach  2.4  shock  diffracted  from  a  step 
9  MP5  scheme  with  RK3  using  300,000  cells 
9  Comparison  with  experiment  shows  excellent  agreement 
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CFD  Results:  Rayleigh-Taylo 

r  Instability 

9  Acceleration  of  a  heavy  fluid  to  a  lighter  fluid 
9  MP5  scheme  with  RK3  using  1.6M  cells 
9  Contact  discontinuity  well  resolved;  evidence  of  fine  scale 
instability  structure 
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Cellular  Detonation 


Test  setup: 

9  Wall  sparked  ignition  (P  =  40  atm;  T  =  1500  K)  with 
premixed  Stoichiometric  Mixture  of  Air 

9  Contact  discontinuity  initially  disturbed  in  2-D  simulation 
9  Maas  and  Warnatz4  H2-O2  reaction  mechanism 

P  =  40  atm 


H - 30  cm - H 


4Maas,  U.  and  J.  Warnatz  (1988).  Combust.  Flame  74,  5369. 
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Cellular  Detonation 


9  Pressure  and  temperature  evolution  of  flow  field 
9  Cellular  structure  developed  due  to  to  flame  front  instability 
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Performance  Results:  Algorithm  1 


How  effective  is  global  memory  access? 
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Performance  Results:  Algorithm  1  vs.  2 


o  Measurement  of  the  performance  of  the  kinetics  solver  for 
different  species  sizes. 

®  Grid  size  is  varied  due  to  limitation  of  global  memory 
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Performance  Results:  CFD 


®  ADERWENO  shows  substantial  advantages  over  the  MP5  due 
to  single  step  integration 


Numbers  of  Elements 
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Performance 


o  Speed-up  obtained  for  a  larger  mechanism  (C/-/4  —  O2)  is 
nearly  40  times  faster 
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Conclusion  and  Future  Works 


Accomplishment: 

9  Basic  CFD  framework  for  fluid  simulation  with  detailed 
chemical  kinetics. 

9  Performance  obtained  in  both  cases  are  very  promising:  up  to 
60  times  for  non-reacting  flow  and  up  to  40  for  reacting  flow 

Future  Works: 

9  Extension  to  Multi-GPU  using  MPI 
9  Col  I  isiona  1-Rad  iative  kinetics  for  partial  ionized  gas 
9  MHD  simulation  for  electromagnetic  field  effects 
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